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In this work we show that, using the eigen-decomposition of the adja- 
cency matrix, we can consistently estimate feature maps for latent position 
graphs with positive definite link function ^•, provided that the latent posi- 
tions are i.i.d. from some distribution F. We then consider the exploitation 
task of vertex classification where the link function !c belongs to the class 
of universal kernels and class labels are observed for a number of vertices 
tending to infinity and that the remaining vertices are to be classified. We 
show that minimization of the empirical ip -risk for some convex surrogate 
If of — 1 loss over a class of linear classifiers with increasing complexities 
yields a universally consistent classifier, i.e., a classification rule with error 
converging to Bayes optimal for any distribution F. 



1 . Introduction. The classical statistical pattern recognition setting involves 

{X, Y), (Xi , Fi ), {X2,Y2),..., [Xn , Yn ) Fr,^ 

where the X; E c M*^ are observed feature vectors and the Yi ^ '3/ = {—1, 1} 
are observed class labels, for some probability distribution F%-.i!/ on 3C x ^ . 
Let ® = {[Xi, Yi)}'}^^. A classifier X ^ {-1, 1} whose probabUity of er- 

ror P[h(X;®) 7^ F|®] approaches Bayes-optimal as n ^ oo for all distributions 
FscJi/ is said to be universally consistent. For example, the k-NN classifier with 
fc— »oo,A;/n— >Ois universally consistent [28]. 

In this paper, we consider the case wherein the feature vectors are unob- 
served, and we observe instead a latent position graph G — G{X,Xi,...,Xn) on 
n + \ vertices with link function k: 3C x 3C ^ [0, 1]. The graph G is constructed 
such that there is a one-to-one relationship between the vertices of G and the 
feature vectors X,X\,...,Xn, and the edges of G are conditionally independent 
Bernoulli random variables given the latent X, Xi , . . . , X„ . We show that there ex- 
ists a universally consistent classification rule for this extension of the classical 
pattern recognition setup to latent position graph models, provided that the link 
function k is an element of the class of universal kernels. 
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The above setting of classification for latent position graphs, with k being the 
inner product in , was previously considered in [30] . It was shown that the 
eigen-decomposition of the adjacency matriK A yields a consistent estimator, 
up to some orthogonal transformation, of the latent vectors X,Xi,...,Xn. There- 
fore, the k-NN classifier, using the estimated vectors, with A;— >oo, fc/n— >Ois 
universally consistent. When k is a general, possibly unknown, link function, we 
cannot expect to recover the latent vectors. However, we can obtain a consistent 
estimator of some feature map $: — > ^ of k". Classifiers that only use the fea- 
ture map $ is universally consistent if the space is isomorphic to some dense 
subspace of the space of measurable functions on 3C . The notion of a universal 
kernel [19, 24, 25] characterizes those k whose feature maps $ induces a dense 
subspace of the space of measurable functions on X . 

The structure of our paper is as follows. We introduce the framework of latent 
position graphs in § 2. In § 3, we show that the eigen-decomposition of the ad- 
jacency matrix A yields a consistent estimator for a feature map 3C ^ h oiK. 
We discuss the notion of universal kernels and the problem of vertex classifica- 
tion using the estimates of the feature map $ in § 4. In particular, we show that 
the classification rule obtained by minimizing a convex surrogate of the — 1 
loss over a class of linear classifiers in is universally consistent, provided that 

— » 00 in a specified manner. We conclude the paper with discussion of how 
some of the results presented herein can be extended and other implications. 

We make a brief comment on the setup of the paper. The main contribution 
of the paper is the derivation of the estimated feature maps and their use in con- 
structing a universally consistent vertices classifier. We have thus considered a 
less general setup of compact metric spaces, linear classifiers, and convex, dif- 
ferentiable loss functions. Extending the results herein to a more general setup 
where the latent positions are elements of a (non-compact) metric space, the 
class of classifiers are uniformly locally- Lipschitz, and the convex loss function 
satisfies the classification-calibrated property [1] are straightforward. 

2. Framework. Let {3C ,d) be a compact metric space and F a probability 
measure on the Borel cr-field oiX . Let ^ x — > [0, 1] be a continuous, pos- 
itive definite kernel. Let L?-{3C ,F) be the space of square -integrable functions 
with respect to F. We can define an integral operator L^{3C ,F) ^ L?-{X ,F) 



J^^ is a compact operator and is of trace class. 

Let {A;} be the set of eigenvalues of JJT ordered as Ai > A2 > ••• > 0. Let {j/'j} 



by 
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be a set of orthonormal eigenfunctions of corresponding to the {Aj}, i.e., 

^ ipi{x)ipj{x)dF{x) = 5ij 

The following Mercer's representation theorem [9, 27] provides a representation 
for K in terms of the eigenvalues and eigenfunctions of Jif defined above. 

Theorem 2.1. Assume the above settings. Then we have 

00 

(2.1) K{x,x') = '^Xjipj{x)ilij{x'). 

J=l 

The sum in Eq. (2.1) converges absolutely for each x andx' in supp(F) x supp(F) 
and uniformly on supp(F) x supp(F). Let ^ be a Hilbert space whose elements 
are of the form 

(2.2) r]^'^aj\fXjipj, [aj)^l2. 

j 

with inner product 

(2.3) aj iPj,Y^ bj a/Aj. ip})^^Y. aj bj 

i i i 

ThenM" is the reproducing kernel Hilbert space for k. 

By Mercer's representation theorem, we have k{-,x) — -/Aj {x)^^ij ijjj (•). 
We thus define the feature map $ : ^ — » Z2 by 

<S>ix) = {/?^xPj{xy.j = l,2,...] 

Let d be an integer with d>l. We also define the follovnng map $d : — > M*^ 

^d{x] = {\fljipj{xy.j = 1,2,. ..,d) 

We will refer to $d as the truncation of $ to M'' . 

Now, for a given n, let Xi,...,X„ F. Define K = (;c(X;,X^));^^j. Let A be 
a symmetric random hollow matrix where the entries {A,j},<j are condition- 
ally independent Bernoulli random variables with P[A,j = 1] = K,j for all i,j e 
[n], i < 7. A is the adjacency matrix corresponding to a graph with vertex set 
{1,2, . . . , n}. A graph G whose adjacency matrix A is constructed as above is an 
instance of a latent positions graph [14] where the latent positions are sampled 
according to F and the link function is k. 
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2.1. Related Work. The eigen-decomposition of adjacency matrices is re- 
lated to the notion of spectral clustering. See [32] and the references therein for 
a survey of spectral clustering techniques. Of particular relevance to our work 
are investigations on the theoretical properties of spectral clustering (and the 
related notion of kernel PCA) for latent position graphs, which we now mention. 

There are two main source of randomness in latent position graphs. The first 
source of randomness is due to the sampling procedure and the second source 
of randomness is due to the conditionally independent Bernoulli trials that gave 
rise to the edges of the graphs. The randomness in the sampling procedure and 
its effects on spectral clustering and/ or kernel PCA had been widely studied. 
In the manifold learning literature, the latent positions are sampled from some 
manifold in Euclidean space and [2, 12, 13] among others studied the conver- 
gence of the various graph Laplacians matrix to their corresponding Laplace- 
Beltrami operators on the manifold. [22, 33] studied the convergence of the eigen- 
values and eigenvectors of the graph Laplacian to the eigenvalues and eigen- 
functions of the corresponding operators in the spectral clustering setting. 

The matrix K/n can be considered as an approximation of Jif for large n, 
that is, we expect the eigenvalues and eigenvectors of K/ n to converge to the 
eigenvalues and eigenfunctions of ^ in some sense. This convergence is an 
important in understanding the theoretical properties of kernel PCA, see e.g., 
[3, 8, 17, 22, 23, 36]. We summarized some of the results from this body of work 
that are directly pertain to the current paper in § B of the appendix. We note that 
the above papers do not consider the vertex classification problem directly and 
their results are mainly on the rate of convergence of the eigenvalues and eigen- 
vectors or the projection error when projecting onto the subspace spanned by 
the eigenvectors of the kernel matrix. 

The Bernoulli trials at each edges and its effects had also been studied. We 
mention three examples. A popular latent position graphs model is the stochas- 
tic blockmodel of [15]. In a stochastic block model graph, each vertex is associ- 
ated with one of b possible blocks, and the edges are conditionally independent 
Bernoulli random variables given the block memberships of the vertices. Fur- 
thermore, the probability of an edge between two vertices is determined solely 
from their block memberships. [21] and [29] showed that spectral clustering on 
the normalized Laplacian and adjacency matrices, respectively, yield consistent 
estimates for the block memberships. Another latent position graphs model is 
the random dot product graphs model [34], where the latent positions are in the 
positive orthant of and the link function is the usual Euclidean inner prod- 
uct. [30] consider the vertex classification problem for these random dot product 
graphs and showed that the A;-NN algorithm using the spectral embeddings of 
the adjacency matrix yields a universally consistent classifier. [20] studied the 
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convergence of the eigenvalues and eigenvectors of the adjacency matrix A to 
that of the integral operator for the class of inhomogeneous random graphs. 
The inhomogeneous random graphs in [20] has latent positions that are uniform 
[0, 1] random variables (the link function k can be arbitrary symmetric func- 
tions). For more on inhomogeneous random graphs the reader may refer to [7]. 

3. Estimation of Feature Maps. We assume the setting of § 2. Let us denote 
by ^#rf(M) and ^/^d,nW the set of d x d matrices and d x n matrices on R, re- 
spectively. Let UaSaU^ be the eigen-decomposition of A. For a given d > 1, let 
Sa e ^^dW be the diagonal matrix comprising of the d largest eigenvalues of 
A and let Ua e ^n.dW be the matrix comprising of the corresponding eigen- 
vectors. The matrices Sk are Uk are defined similarly. For a matrix M, ||M|| refers 
to the spectral norm of M while ||M||f refers to the Frobenius norm of M. For a 
vector i; e M" , 1 1 y 1 1 will denote the Euclidean norm of v . 

The key result of this section is the following theorem which shows that, given 
that there is a gap in the spectrum of at ( J^^), by using the eigen-decomposition 
of A we can accurately estimate the truncated map in §2 up to an orthogonal 
transformation. 

Theorem 3.1. Letd > 1 be given. Denote by 5 d the quantity Xd{.^)—Xd+i{.^) 
and suppose that 5 d > 0. Then with probability greater than 1 - 2rj, there exists a 
unitary matrixW ^ Md{^) such that 



where ^d denotes the matrix in Jin whose i-th row is^diXi). Let us denote 

1/2 ' 

by^diXi) the i-th row o/VaS^ W. Then, for each i e [n] andanye>0 



We now proceed to prove Theorem 3.L A rough sketch of the argument goes 
as follows. First we will show that the projection of A onto the subspace spanned 
by Ua is "close" to the projection of K onto the subspace spanned by Ur. Then 
we will use results on the convergence of spectra of K to the spectra of J(f to 
show that the subspace spanned by Ua is also "close" to the subspace spanned 
by$d. 

We need the following bound for the perturbation A — K from [20] . The con- 
vergence of the spectra of A to that of as given by Theorem 6. 1 in [20] is similar 
to that given in the proof of Theorem 3.1 in the current paper, but there are suf- 
ficient differences between the two settings and we do not see an obvious way 
to apply the conclusions of Theorem 6. 1 in [20] to the current paper. 



(3.1) 



IIUaS^/'w - *rf llf < 275/ V^bg(^ 



(3.2) 




6 



TANG, SUSSMAN AND PRIEBE 



Proposition 3.2. For A andK as defined above, with probability at least l-rj 
we have 

(3.3) ||A-K||<2^nlog(n/r7) 

The constant in Eq. (3.3) was obtained by replacing a concentration inequal- 
ity in [20] with a slightly stronger inequality from [31]. We now show that the 
projection matrix for the subspace spanned by Ua is close to the projection ma- 
trix for the subspace spanned by Ur. 

Proposition 3.3. Let^^ = UaUJ and^^K = UkU^. Denote by 5 a the quan- 
tity Arf(jr) — Ad+i(Jjr) and suppose that 5 a > 0. If n is such that 5 a > 8(1 + 
-/2)n~i/2 ^\og[nlr]). Then with probability at least 1 - 217, 



(3.4) ||i3^A-.'3^ll<4. 



' log(n/)7) 
n5l 



Proof. By Eq. (B.4) in Theorem B.2, we have with probability at least l — rj, 

Arf(K)-Arf+i(K)>5d-4V2^ 
Now, let Si and S2 be defined as 



log(2/)7) 



Si - {A: A > nArf(K)-2^nlog(n/r?)} 
S2 = {A : A < nArf+i(K) + 2 sj n\og[nlr])] 

Then we have, with probability at least 1 — 17 

dist(Si ,S2)>n5rf-4^/2 v/nlog(2/r?) - 4 ^nlog(n/r?) 



(3.5) 



> n5rf - 4(1 + V2)Vnlog(n/)7) 



Suppose for the moment that Si and S2 are disjoint, i.e., that dist(Si,S2) > 0. Let 
^a(Si) be the matrix for the orthogonal projection onto the subspace spanned 
by the eigenvectors of A whose corresponding eigenvalues lies in Si . Let 3\ be 
defined similarly. Then by the sinO theorem [10] we have 

IIA-KII 

l|i3^A(Si)-i3^(Si)||<. 



dist(Si,S2) 

By Eq. (3.5) and Proposition 3.2, we have, with probability at least (1 — 217), 



2 a/ nlog(n/}7) /log(n/n) 

l|i3^A(Si) - i3^(Si)|| < ^ ^ , = < 4 J " 

n5d-4(l + V2)^nlog(n/r?) V "^rf 
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provided that 4(1 + y2)^nlog(n/}7) < n5dl2. 

To complete the proof, we note that if 4(1 + v^)^ n\o%{nlr]) < n5d/2, then 
Si and S2 are disjoint. Thus 3\{Si) = UkU|. Finally, if ||A- K|| < 2^ nlog[n/ri), 
then the eigenvalues of A that lie in Si are exactly the d largest eigenvalues of A 
and ^a(Si) = UaUJ. Eq. (3.4) is thus established. □ 

1/2 

The next result states that the rows of UrSjj correspond to the projection of 
$ onto the d dimensional subspace spanned by the d largest eigenfunctions 
of Jifjf,n where : ^ ^ is a linear operator on ^ induced by k and 
the Xi,X2,...,X„. The eigenfunctions of correspond to extensions of the 
eigenvectors of K (in M") to See § B of the appendix for more details. 

1/2 

Lemma 3.4. Let S^d be as defined in Theorem B.2. The rows o/UrSj^ then 
correspond, up to some orthogonal transformation, to projections of the feature 
map^ ontoW^ via^d, is., there exists a unitary matrixW^.Jid{^) such that 

(3.6) UkS;/%- [i{<^d{^{XiW\---\i[i^d{^{XnWV 

where i is the isometric isomorphism of a finite- dimensional Hilhert space onto 

The proof of Lemma 3.4 is given in the appendix. 

Proof of Theorem 3.1. We first note that the sum of any row of A is bounded 
from above by n, thus ||A|| < n. Similarly, ||K|| < n. On combining Eq. (3.4) and 
Eq. (3.3), we have, with probability at least 1 - 217, 

II^aA- 3^kK|| < ||i3^A(A- K)|| + ||(^a - 

< 2\j nlog(n/)7) + 45^^ sj n\og{n/r]) 



<65/^nlog{n/ri). 



By Lemma A. 1 in the appendix, there exists an orthogonal W e ^d W such that 

IIIIQV2W .T cl/2,i^..-l / . . . J d\\^j,A\\ + ^d\\&>KK\\ 

IIUaSa' W-UkSk' ||<65^V"log("/'?) mF^ 

n^/^^W^ 
- ^ A,(K) 



We note that Ad(K) > nXd{^)/2 provided that n satisfies Arf(jr) > 4V2^/ n-^log{n/r]). 
Thus, we have 



(3.7) IIUaSI/'W- UKS;/'||f < 245-/ ^ ^^^^^^ < 245 f d\og{nlri) 
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with probability at least 1 — Zrj. 

1/2 

Now, by Lemma 3.4, the rows of UrSj^ are (up to some orthogonal trans- 
formation) the projections of the feature map $ onto M*^ via ^d- On the other 
hand, is the projection of k{-,X] onto M"^ via .3^d- By Theorem B.2 in the 

appendix, for all X, we have 

^ Jlog{l/ri) 
Sd^/n 

with probability at least I — Irj. We therefore have, for some orthogonal W e 
(3.8) IIUkS^ W- ^f'^^llf < 2V2^^^^^ 

Od 

vnth probability at least l — ZTj. Eq. (3.1) in the statement of the theorem then 
follows fromEq. (3.7) andEq. (3.8). 

To show Eq. (3.2), we first note that as the {Xi}'-^^ are independent and iden- 
tically distributed, the are exchangeable and hence identically dis- 
tributed. Let rj = n~^. By conditioning on the event in Eq. (3.1), we have 



E[||$rf(Xi)-$rf(Xi)||] < \/Emd{Xi]-^d{Xi 



<V^E[||*ri-$rf|||] 



(3.9) 



< 



2 jGd log n 



because the worst case bound is ||*rf - ^dWr < 2n with probability 1. Eq. (3.2) 
follows from Eq. (3.9) and Markov's inequality. □ 

4. Universally consistent vertex classification. The results in § 3 showed 
that by using the eigen-decomposition of A, we can consistently estimate the 
truncated feature map $rf for any fixed, finite d (up to an orthogonal trans- 
formation). In the subsequent discussion, we will often refer to the rows of the 

1/2 

eigen-decomposition of A, i.e., the rows of UaS^ as the estimated vectors. [30] 
showed that, for the dot product kernel on a finite-dimensional space 3^, the 
-nearest-neighbours classifier on M*^ is universally consistent when we select 
the neighbors using the estimated vectors rather than the true but unknown la- 
tent positions. This result can be trivially extended to the setting for an arbitrary 
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finite-rank kernel k as long as the feature map $ of k is injective. It is also easy 
to see that if the feature map $ is not injective then any classifier that uses only 
the estimated vectors (or the feature map $) is no longer universally consistent. 
This section is concerned with the setting where the kernel k is an infinite-rank 
kernel with an injective feature map $ onto I2. Well-known examples of these 
kernels are the class of universal kernels [19, 24, 25]. 

Definition 4.1. A continuous kernel k on some metric space {.T,d) is an 
universal kernel if for some feature map $: ^ >— » // of to some Hilbert space 
H, the class of functions of the form 

is dense in 'i^(^), i.e., for any continuous function g: >— > M and any e > 0, 
there exists a / e such that 11/ — giloo < e. 

We note that if ^$ is dense in "^{X) for some feature map $ and ^ — » //' 
is another feature map of k, then is also dense in i.e., the universality 

of K is independent of the choice for its feature map. Furthermore, every feature 
map of a universal kernel is injective. 

The following result list several well-known universal kernels. 

Proposition 4.2 ([19, 25]). LetS be a compact subset of . Then thefollow- 
ing kernels are universal on S. 

• The exponential kernel K{x,y) — exp((x,y)). 

• The Gaussian kernelK{x,y) = ey.^{—\\x — y\\^ / a'^) for all a > 0. 

• The binomial kernel k{x ,y] = (1 — (x, for a > 0. 

• The inverse multiquadricsK{x,y) = {c^ + \\x —y\\^]~P withoOandp > 0. 

If the kernel matrix K is known, then results on the universal consistency of 
support vector machines with universal kernels are available, see e.g., [26] . If the 
feature map $ is known, then [4] showed that the fc-nearest-neighbours on $d 
is universally consistent as — > 00 and d ^00 where k and d are chosen using a 
structural risk minimization approach. 

Our universally consistent classifier operates on the estimated vectors and is 
based on an empirical risk minimization approach. Namely, we will show that 
the classifier that minimizes a convex surrogate ip for 0-1 loss from a class of 
linear classifiers S^f^") is universally consistent provided that the convex surro- 
gate if satisfies some mild conditions and that the complexity of the class 'g''''"^ 
grows in a controlled manner. 
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First, we will expand our framework to the classification setting. Let 3^ be as 
in § 2 and let F^f , y be a distribution on x { - 1, 1 } . Let (Xi , Fi ), . . . , , F„+i ) '~ 
Fof ,y and let K and A be as in § 2. The i^s are the class labels for the vertices in 
the graph corresponding to the adjacency matrix A. 

We suppose that we observe only A, the adjacency matrix, and Yi,...,Yn, the 
class labels for all but the last vertex. Our goal is to accurately classify this last 
vertex, so for convenience of notation we shall define X := X„+i and Y := Y„+i. 
Let the rows of UaS^^ be denoted by <^d(Xi), . . . , ^d{Xn+i] (even though the Xt 
are unobserved/unknown). We want to find a classifier h„ such that, for any 
distribution F±\y, 

IE[L„]-E[P[h„(^d„(X))7^F|(^d„(Xi),Fi),...,(a,(^J,i^J]]-P[W)7^F]=:L* 

where h* is the Bayes-optimal classifier and L* is its associated Bayes-risk. 

Let 'io^'^^ be the class of linear classifiers using the truncated feature map $d 
whose linear coefficients are normalized to have norm at most d, i.e., g e S^^^^ 
if and only if g is of the form 

r . /l ii{w,<S>d{x)}>0 
4.1) g{x)={ 

[-1 if{w,<^dix))<0 

for some w eM^ with \\w\\ < d. By the fact that = w e^} is dense 

in "^{X), one can show that empirical risk minimization over the class "iS^^"^ for 
any increasing and divergent sequence [dn] yields a universally consistent clas- 
sifier. 

We now describe a setup for empirical risk minimization over '^^'^^ for in- 
creasing d where we use the estimated in place of the $d . Let us write Ln{w;^d) 
for the empirical error when using the ^d, i.e., 

Lniw; Cd) = -y l{sign((u;, Cd{Xi)}) ^ 1^} < - V l{Yi{w, Cd{Xi)} < 0} 
n ^ n ^ 

We want to show that minimization of L„{u>;^d„) over the class '^^'^"^ for in- 
creasing [dn] leads to a universally consistent classifier for our latent position 
graphs setting. However, the loss function L{f) = P(sign(/(X)) 7^ F) of a classifier 
/ as well as its empirical version L„(/) is based on the 0-1 loss, which is discon- 
tinuous at / = 0. This induces complications in relating L„{w;^d) to L„{w;^d)- 
That is, the classifier obtained by minimizing the — 1 loss using ^ might be very 
different from the classifier obtained by minimizing the 0—1 loss using $. 

To circumvent this issue, we will work with some convex loss function ip that 
is a surrogate of the 0-1 loss. The notion of constructing classification algorithms 
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that correspond to minimization of a convex surrogate for the 0-1 loss is a pow- 
erful one and [1, 18, 35] among others showed that one can obtain, under ap- 
propriate regularity conditions, Bayes-risk consistent classifiers in this manner. 
Let M-^ [0,oo). We define the ip-nsk of /: .1" M by 



Given some data ®„ = {(X;, the empirical (/7-risk of / is defined as 



We will often write R^{f) if the number of samples (X/, Yi) in ®„ is clear from 
the context. Let w E M'',||w|| < d index a linear classifier on S^'''). Denote by 
R^{w;^d), Rip,n[w,^d), Rf,niu>; C^'^^) and Rip^„{w] ^(^') the various quantities anal- 
ogous to L(if;$d), Ln{w,^^'^'i) and Ln{w;^^^^) for 0-1 loss defined 
previously. Let us also define i?* as the minimum (/j-risk over all measurable 
functions/: Jf-^M. 

In this paper, we will assume that the convex surrogate : M [0, oo) is differ- 
entiable with ip'{0) < 0. This implies that ip is classiflcation-calibrated[l]. Exam- 
ples of classification-calibrated loss functions are the exponential loss function 
(p[x) — exp(-x) in boosting, the logit function if{x) = log2(l+exp(-x)) in logistic 
regression, and the square error loss ip{x) = {l — xY. For classification-calibrated 
loss functions, we have the following result. 

Theorem 4.3 ([1]). Letip: M-^ [0,oo) be classification-calibrated. Thenforany 
sequence of measurable functions /, : ^ M and every probability distribution 



We now state the main result of this section, which is that empirical (/?-risk 
minimization over the class 'g''''"^ for some diverging sequence (t^^) yields a uni- 
versally consistent classifier for the latent position graphs setting. 

Theorem 4.4. Lete e (0, 1/4) be fixed. For a given d, letCd = m3x{ip'{—d), ip'{d]}. 
Suppose thatdn is given by the following rule 



Let gn be the classifier obtained by empirical Lp -risk minimization over the class 
<^idn)_ Then R,f,n{gn) — > i?* as n ^ oo and hence sign(g„) is universally consis- 
tent, i.e., 



R^{f)^E^{Yf{X)) 




Fx ,9, Rf{fi)-^ R% implies L{fi)^ L*. 



(4.2) 




E[P(sign(g„(Cd„(X))) 7^ F|®„)] ^ L* 



as n ^ 00 for any distribution Fx\y 
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Remark. We note that due to the use of the estimated ^ in place of the true 
Theorem 4.4 is limited in two key aspects. The first is that we do not claim 
that g„ is universally strongly consistent for any Fr,y and the second is that 
we cannot specify dn in advance. In return, the minimization of the empir- 
ical (/?-risk over the class '^^'^^ is a convex optimization problem and the so- 
lution can be obtained more readily than the minimization of empirical 0-1 
loss. For example, by using squared error loss instead of 0-1 loss, the classi- 
fier that minimizes the empirical ip-risk can be viewed as a ridge regression 
problem. We note also that as the only accumulation point in the spectrum 
of is at zero, the sequence dn as specified in Eq. (4.2) exists. Furthermore, 
such a sequence is only one possibility among many. In particular, the con- 
clusion in Theorem 4.4 holds for any sequence d„ that diverges and satisfies 
the condition 5^ = o[n~^^^d^/^ ^logn). Choosing the right [d^) requires bal- 
ancing the approximation error inf^g^^irf,,) Rip{g) - and the estimation error 
Rifiign) - infii,6<<f(d«)i?(^(g), and this can be done using an approach based on 
structural risk minimization (see e.g. § 18.1 of [11] and [18]). 

We now proceed to prove Theorem 4.4. A rough sketch of the argument goes 
as follows. First we show that any classifier g using the estimated vectors 
induces a classifier g' using the true truncated feature map such that the 
empirical (^-risk of g is "close" to the empirical (/j-risk of g'. Then by applying a 
Vapnik-Chervonenkis type bound for g', we show that the classifier g induced 
by the classifier g that was selected using empirical (/? -risk minimization has tp - 
risk that is "close" to the minimum (/J-risk for the classifiers in the class 
Universal consistency of g and hence of g follows by letting d grows in a speci- 
fied manner. 

Letl< d<n. Let UaS^^ be the embedding of A into E'^. Let e ^rf(M) be 
an orthogonal matrix given by 



The following result state that if there is a gap in the spectrum of at Ad(Jf ), 
then R^^„{w]Cd) and -R(^,„(Wrf is close for all w e M'',||ii'|| < d. That is, 

the empirical (/p-risk of a linear classifier using l^d is not too different from the 
empirical ip -risk of a related classifier (the relationship is given by W^) using $d. 

Proposition 4.5. Let d > I be such that Xd{Jif] > ^d+ii-^) and let Cd = 
max{(f'{d),(p'{-d)}. Then for any w e M*^, \\iv\\ < d, we have, with probability 
at least I- l/n^, 



Wd= min WUaSX'W -^d\\F 



W: W^W=I 



(4.3) 
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Proof. We have 

1 " 

i=l 

Now if is convex and thus locally Lipschitz-continuous. Also, |(u',$d(X))| < d 
for all X e Hence, there exists a constant M independent of n and Fx />y such 
that 

\ip[Mw,Cd{Xi))]-^{MWdW,^dXiM<M\M^^,^dXi^^^^^ 

for all /. Thus, by Theorem 3.1, we have, 
(4.4) 

M " 

I^V'." ( ; ^rf ) - ^^,«(Wrf ; $rf )| < — ^ I i^- , ^rf (X,- )) - i^- (Wrf ^ , $rf 

! = 1 

<-yKdiXi)-iWdV<s>diXi)\\ 

n ^ 

1=1 



< 



< 



W^diXi) - {Wdf^diXi)f^ 



1/2 



M 



-=||UASf -(Wrf)^*rf||f 



-/7F 



2,../3dlogn 



with probability at least l — l/n^. Finally, by the mean- value theorem, we can 
take M = dmax{(/3'(ci), to complete the proof. □ 

The Vapnik-Chervonenkis theory for 0-1 loss function can also be extended 
to the convex surrogate setting R^p . In particular, the following result provides a 
uniform deviation bound for — R^pAf)\ for functions / in some class ^ 

in terms of the VC-dimension of ^. 

Lemma 4.6 ([18]). Let ^ be a class of functions with VC-dimension V <oo. 
Suppose that the range of any f e ^ is contained in the interval [—d,d]. Let n>5. 
Then we have, with probability at least 1 — l/n^, 



(4.5) sup (/) - R^Af)\ < max{^'{dl ^'{-d)]^ / 3^1og n 
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The following result combines Proposition 4.5 and Lemma 4.6 and shows that 
minimizing ^,^,„(ti/;^d) over w e M'^JIwH < d leads to a classifier whose (/3-risk 
is close to optimal in the class 'ia^'^^ with high probability. 

Lemma4.7. Letd > 1 besuch thatXd{.^) > ?^d+ii-^) andletCd — rcisx{ip'{d), (/?'(— d)}. 
LetWd minimizes R^p^n{w]^d) overW^ ,\\w\\ < d. Then with probability at least 
l-2/n2, 



2 ^ i^dlogn 



(4.6) R^{^NdWd^^d)- inf R^{w;^d)<745-/dCd 

Proof. For ease of notation, we let e(n, d) be the term in the right hand side 
of Eq. (4.3) and let C(n, d) be the term in the right hand side of Eq. (4.5). Also let 
jjjid) .— arginfjygc^(d) R^{w,^d)- We then have 

Rv(}MdWd;^d)<R^,n{^du)d;^d] + C{n,d) 

<R^,n{u>d;Zd) + e{n,d) + C{n,d) 
< Rv.nii^dVwd:^d) + e{n, d) + C{n, d) 
<Rip,„{Wd;^d) + 2e{n,d)+C{n,d) 
<R^{wd;^d) + 2e{n,d) + 2C{n,d) 

with probability at least 1 — 2/n2. □ 

Remark. Eq. (4.6) is a VC-type bound. The term d^l'^5~^ in Eq. (4.6) can be 
viewed as contributing to the generalization error for the classifiers in That 
is, because we are training using the estimated vectors in , the generalization 
error not only depends on the dimension of the embedded space, but also de- 
pends on how accurate the estimated vectors are in that space. 

We now have the necessary ingredients to prove the main result of this sec- 
tion. 

Theorem 4.4. Let (dn) be a non-decreasing sequence of positive integers that 
diverges to oo and that 



„ 9 . d log n , , 

(4.7) 5-ldnCd„sj^^^o{l). 

By Lemma 4.7 and the Borel-Cantelli lemma, we have 



lim 



R^^d„u>d„;^d„)- inf R^{w;^d„) 



= 
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almost surely. As [dn] diverges, lim„^ooinfjygc^(d„)-Ry(u^;$dJ = J?* by Proposi- 
tion A. 2. We therefore have 



almost surely. Now fix a n. The empirical (/? -risk minimization on w e M^", || w|| < 
d„ using the estimated vectors ^d„ give us a classifier {Wd„>^d„}- We now con- 
sider the difference R^,n{wd„ ) U„)-Rf{^d„ wa,, ; $d J. By a similar computation 
to that used in the derivation of Eq. (4.4), we have 

<dnCd,M\USX)-md„V'^d„ix)\\] 

< dn Cd„ ^IE[|ia,m-(WdJ^$d„(X)||2] 



, / &d log n 

<275-ld„Cd,Al ^^=o{l). 



We therefore have 



lm^R^Awd„-,idn)^Rl- 
Thus, by Theorem 4.3, we have 

UmE[L„(it;d„;CrfJ] = r 

n— »oo 

The only thing that remains is the use of -{XdW — '^d+iW) as an estimate for 
5d - By Proposition 3.2 and Theorem 3.1, we have 

, /logr2/n2) 
(4.8) sup|5d - ^(Ad(A)- Ad+i(A))| < 10 J 

d>i " \ n 

with probability at least l—2/n^. Thus, if dn satisfy Eq. (4.2), then Eq. (4.8) im- 
plies that Eq. (4.7) holds for d„ — > oo with probability at least 1 — Finally, 
we note that as n ^ cxd, there exists a sequence dn that satisfies Eq. (4.2) and di- 
verges to 00, as the only accumulation point in the spectrum of Jif is at zero. □ 

5. Conclusions. In this paper we investigated the problem of finding a uni- 
versally consistent classifier for classifying the vertices of latent position graphs. 
We showed that if the link function k used in the construction of the graphs 
belong to the class of universal kernels, then an empirical (/?-risk minimization 
approach, i.e., minimizing a convex surrogate of the — 1 loss over the class of 
linear classifiers in R'^" for some sequence rf„ — > oo yields universally consistent 
vertices classifiers. 
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We have presented the universally consistent classifiers in the setting where 
the graphs are on n + 1 vertices, there are n labeled vertices, and the task is to 
classify the remaining unlabeled vertex. It is easy to see that in the case where 
there are only m < n labeled vertices, the same procedure given in Theorem 4.4 
with n replaced by m still yields universally consistent classifiers, provided that 
m ^ 00. 

The bound for the generalization error of the classifiers described in § 4 is 
of the form 0{n~^l'^5~^ \j (i^logn). This bound depends on both the subspace 
projection error in § 3 as well as the generalization error of the class It is 

often the case that the bound on the generalization error of the class can 
be improved, as long as the classification problems satisfy a "low-noise" condi- 
tion, i.e., that the posterior probability r\{x) = P[F = 1|X = x] is bounded away 
from 1/2. Results on fast convergence rates in low-noise conditions, e.g., [1, 5] 
can thus be use, but as the subspace projection error is independent of the low- 
noise condition, there might not be much improvement in the resulting error 
bound. 

Also related to the above issue is the choice of the sequence {dn^- If more is 
known about the kernel k, then the choice for the sequence {dn) can be adjusted 
accordingly. For example, good bounds for A^t = X!j>fc ■^jC-^)' the sum of the tail 
eigenvalues of JT, along with bounds for the error between the truncated feature 
map $d and the feature map $ from [6, 8, 23] can be used to select the sequence 

{dn). 

Finally, it is of potential interest to extend the results herein to sparse graphs, 
graphs with attributes on the edges, latent position graphs with non-positive 
definite link functions k, and graphs with errorfully observed edges. 

APPENDIX A: ADDITIONAL PROOFS. 

Lemma 3.4. Let ^r,n £ be the vector whose entries are -/Xr^ri^i) for i = 
1, 2,..., n with A,. = Ar(Jf ). We note that K = Let u^^\ . . . , u^''^ be 

the eigenvectors associated with the d largest eigenvalues of K/n. We have 

d oo 
s=l r=l 

The ij -th entry of £5^K is then given by 

d 00 
s=l r=l 
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Let v^^\ v'-''^ be the extensions of u^^\ U^'^^ as defined by Eq. (B.2). We then 
have, for any s — \,2... ,d, 




We thus have 

Now let ^^'\X] = XT=i{i>^'\4>rVJr)jfvi'\X]VJripr ^ ^- IS the embed- 

ding of the sequence {{v^''^, -rkr^\)r)yev^'\X))%^ e h into ^ (see Eq. (2.2)). By 
Eq. (A.1) and the definition of (Eq. (2.3)), the ii-t\i entry of can be 

written as 

d d 
s=\ r=\ s=l 

We note that, by the reproducing kernel property of k-(-, x), 

oo 

r=l 
oo 

= Y,{i^^'\^r\/^r)Mv^'\Ki■,X)}^^/?i,ll>r 

r=l 

00 

= {v^'\K{;X))^Y,^V^'^''i'r\/^r}yf\/^r4>r 
r=l 

= {v^'\K{;X)),y,V^'\ 
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As the v^^^ are orthogonal with respect to (•, , the ij -th entry of £5*kK can also 
be written as 



s=l s=l 

d d 

s=ls'=l 

d d 

s=l s=l 
= {^dK[;Xill^dK{;Xj)).yf 

As the S^dK{-,-) lies in a -dimensional subspace of M", they can be isomet- 
rically embedded into M*^. Thus there exists a matrix X e ^„,d(]R) such that 
XX^ = UkSkU| and that the rows of X correspond to the projections S^dK{-,Xi). 

1/2 

Therefore, there exists a unitary matrix W e (R) such that X = UrSj^ W as 
desired. □ 

Lemma A.l. LetAandB benxn positive semidefinite matrices with iank{A) = 
rank(B) = d. LetX,Y e ^n,dW be of full column rank such thatXK^ = A and 
YY^ = B. Let 5 be the smallest non-zero eigenvalue ofB. Then there exists an 
orthogonal matrixW ^ ^<^dW such that 



Proof. Let R = A — B. As Y is of full column rank, Y^Y is invertible and its 
smallest eigenvalue is 5. We then have 

Y = XX^Y(Y^Y)"i - RY(Y^Y)"^ 

Let T = X^YCY^Y)-!. We then have 

T^T - 1 = (Y^Y)"1y^XX^Y(Y^Y)"1 - 1 = (Y^Y)"1yRY(Y^Y)"^ 

Therefore, 

-(Y^Yr^Y^||R||Y(Y^Yr^ ^ T^T - 1 ^ (Y^Yr^Y^||R||Y(Y^Yri 
where refers to the positive semi-definite ordering for matrices. We thus have, 

iiT^T-iiip < iiRii • iKY^Yr^iiF < y^iiRii • m^YT'w < 
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Now let W be the orthogonal matrix in the polar decomposition T = W(T^T)i/2. 
We then have 

||XW-Y||f <||XW-XT||f + ||XT-Y||p 

< IIXII • |KT^T)i/2 - I\\p + IIRII . IIYCY^Yr^llF 

< IIXII • |KT^T)i/2 - I\\p + IIRII . IIYII • IKY^Yr^llf 
Now, |KT^T)i/2 - I||f < IIT^T- I||f. Indeed, 

d d 
||(T^T)i/2 - 1||2 = J](AKT^T)i/2 - 1)2 < ^(AKT^T) - l)^ = ||T^T - 1||2^ 

i=l !=1 

We thus have 

||XW-Y||< (IIXII + ||Y||)^^^ 
and Eq. (A.2) follows. □ 

Proposition A.2. Let k be a universal kernel on 3C and let 3C ^ I2 be a 
feature map of k. Let '^'^^\'^^^\... be the sequence of classifiers of the form in 
Eq. (4.1). Then 

{A.3) lim inf RJf) = R* 

d^oofe'gW ^ ^ 

Proof. We note that this result is a slight variation of Lemma 1 in [18]. For 
completeness, we sketch its proof here. Let /* be the function defined by 

f*{x] = mm{xMa) + {l - r?(x))(^(-a)} 



where r][x) ^ P[F = 1|X = x]. Then = E[/*]. Now, for a given p e [0, 1/2], let 
Hp = {x: \q{x]-l/2\ > p] and let i?^ be the complement of Hp. We consider the 
decomposition 

R; = MifiXnx e Hp }] + E[/*(X)1{X eHp}]. 

The restriction of /* to Hp is measurable with range [—Cp,Cp] for some finite 
constant Cp > 0. The set of functions {w,^}jf is dense in "^{X) and hence also 
dense in L^{3f^ ,Fj(). Thus, for any e > 0, there exists aw such that 

E[f(X)l{X&Hp}]-E[{wMX)).^nX^Hp}]<€. 

Furthermore, E[f*{X)l{X e Hp}] ^ as /3 ^ 1/2. Indeed, H1/2 = {x: r?(x) e 
{0, 1}} so we can select a so that (/?(«) = if q{x) = 1 and ip{-a) — Oif q[x) = 0. 
To complete the proof, we note that the classes 'io^'^^ are nested, i.e., "^^^^ c 
c^(d+i) Hence infy-gt^(d) R,^{f] is a decreasing sequence that converges to -R* as 
desired. □ 
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APPENDIX B: SPECTRA OF INTEGRAL OPERATORS AND KERNEL MATRICES. 

We can tie the spectrum and eigenvectors of K to tiie spectrum and eigen- 
functions of by constructing an extension operator J^Gf ,n for K and relating 
the spectra of to that of J^^„ [22] . Let ^ be the reproducing kernel Hilbert 
space for k. Let JJTjf : ^ ^ and JJiGf ,n : ^ ^ be the linear operators de- 
fined by 



'X54f'7= j {ri,K{-,x)}j^K{;x)dF{x] 
1 " 

J^^,nr]= - Y{r],K{;Xi)),yfK{;Xi). 

n ^ 

The operators Jfjf and Jif^'.n are defined on the same Hilbert space in con- 
trast to Jif and K which are defined on the different spaces L?-[3C ,F) and M", 
respectively. Thus, we can relate the spectra of and ^ye.n - Furthermore, we 
can also relate the spectra of :X and J^Gr as well as the spectra of K and J^Gf^.n, 
therefore giving us a relationship between the spectra of JT and K. A precise 
statement of the relationships is contained in the following results. 

Proposition B.l ([22], [33]). The operators J^j^ anrfJfjf.n are positive, self- 
adjoint operators and are of trace class with being of finite rank. The spec- 
tra ofJif and J^Gs^ are contained in [0, 1] and are the same, possibly up to the zero 
eigenvalues. IfX is a non-zero eigenvalue ofJif and u and v are associated eigen- 
function ofJif andJ(fjp, normalized to norm 1 in L?-{3C ,F) and .^ , respectively, 
then 

v{x) if 
(B.l) u{x)— /orx e supp(F); v{x) = —= \ K[x,x^u{x^dF[x^ 
VA VA ]^ 

Similarly, the spectra ofK/n and JJiGf ,„ are contained in [0, 1] and are the same, 
possibly up to the zero eigenvalues. IfX is a non-zero eigenvalue ofK and u and v 
are the corresponding eigenvector and eigenfunction of ¥./ n and:Xye,n, normal- 
ized to norm 1 mM" and.^, respectively, then 

v{Xi) 1 
(B.2) Ui = —^; v[-) = —=2_^K{-,Xi)Ui 

V A y Xn 1=1 

Eq. (B.2) in Proposition B.l states that an eigenvector u of K/n, which is only 
defined for Xi,X2,...,X„, can be extended to a eigenfunction i*: '-^ ^ of 
JiiGf ,n defined for all X , and furthermore, that ut = for all / = 1,2, ... , n. 
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Theorem B.2 ([22, 36]). Let z > be arbitrary. Then with probability at least 

l-2e-\ 

(B.S) \\X^^-J^^,n\\HS<2V2j- 

V n 

where\\-\\Hs is the Hilbert-Schmidt norm. Let{Xj} be a decreasing enumeration 
of the eigenvalues for Jf^ and let {Xj } be an extended decreasing enumeration of 
■^j^,n, i-e., Xj is either an eigenvalue of :X,yf^n or Xj = 0. Then the above bound 
and a Lidskii theorem for infinite dimensional operators [16] yields 

(B.4) [X'^^J-'^jff'<2^\ll 

with probability at least \ — 2e~^. For a given d>\ and t > 0, if the number n of 
samples X,- ~ F satisfies 

4V2J— < Ad -Arf+i 
V n 

then with probability greater than l-2e~^ 

2V2/f 



(B.5) \\^d-S^d\\HS< 



(Ad -Arf+i)/n 



where S?d is the projection onto the subspace spanned by the eigenfunctions cor- 
responding to the d largest eigenvalues of ^ and SPa is the projection onto the 
subspace spanned by the eigenfunctions corresponding to the d largest eigenval- 
ues of ^t^yf^n- 
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